CS267 Project Proposal: Performance and Scalability Characteristics of Spark

نویسنده

  • Mosharaf Chowdhury
چکیده

In recent years, a new model of performing data-parallel computations on clusters of unreliable machines (e.g., MapReduce [1], Dryad [2]) has become widely popular. These systems achieve their scalability and fault tolerance by providing a programming model where users create acyclic data flow graphs to pass input data through a set of operators. This allows the underlying system to schedule jobs and to react to faults without user intervention. While this data flow programming model is useful for a large class of applications, applications that reuse a working set of data across multiple parallel operations cannot be expressed efficiently as acyclic data flows. Such iterative jobs are extremely prevalent in machine learning algorithms that repeatedly apply a function to the same dataset to optimize a parameter (e.g., through gradient descent). While each iteration can be expressed as a MapReduce/Dryad job, each job must reload the data from disk, incurring a significant performance penalty.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ddup - towards a deduplication framework utilising apache spark

This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14]...

متن کامل

Scalability Potential of BWA DNA Mapping Algorithm on Apache Spark

This paper analyzes the scalability potential of embarrassingly parallel genomics applications using the Apache Spark big data framework and compares their performance with native implementations as well as with Apache Hadoop scalability. The paper uses the BWA DNA mapping algorithm as an example due to its good scalability characteristics and due to the large data files it uses as input. Resul...

متن کامل

The Effects of Ethanol–gasoline Blend on Performance and Exhaust Emission Characteristics of Spark Ignition Engines

The effects of unleaded gasoline and unleaded gasoline–ethanol blends on engine performance and pollutant emissions were investigated experimentally in a single cylinder, four-stroke spark-ignition engine with variable engine speeds (2600–3500 rpm). Four different blends on a volume basis were applied. These are E0 (0% ethanol + 100% unleaded gasoline), E3 (3% ethanol + 97% unleaded gasoline), ...

متن کامل

Experimental Investigation on Hydrous Methanol Fueled HCCI Engine Using Spark Assisted Method

The present work investigates the performance and emission characteristics of hydrous methanol fuelled Homogeneous Charge Compression Ignition (HCCI) engine. In the present work a regular diesel engine has been modified to work as HCCI engine. Hydrous methanol is used with 15% water content in this HCCI engine and its performance and emission behavior is documented. A spark plug is used for ass...

متن کامل

Performance and Scalability of Broadcast in Spark

Although the MapReduce programming model has so far been highly successful, not all applications are well suited to this model. Spark bridges this gap by providing seamless support for iterative and interactive jobs that are hard to express using the acyclic data flow model pioneered by MapReduce. While benchmarking Spark, we identified that the default broadcast mechanism implemented in the Sp...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010